Web-Based Lemmatisation of Named Entities
نویسندگان
چکیده
Identifying the lemma of a Named Entity is important for many Natural Language Processing applications like Information Retrieval. Here we introduce a novel approach for Named Entity lemmatisation which utilises the occurrence frequencies of each possible lemma. We constructed four corpora in English and Hungarian and trained machine learning methods using them to obtain simple decision rules based on the web frequencies of the lemmas. In experiments our web-based heuristic achieved an average accuracy of nearly 91%.
منابع مشابه
Named Entity Matching Method Based on the Context-Free Morphological Generator
Polish named entities are mostly out-of-vocabulary words, i.e. they are not described in morphological lexicons, and their proper analysis by Polish morphological analysers is difficult. The existing approaches to guessing unknown word lemmas and descriptions do not provide results on satisfactory level. Moreover, lemmatisation of multiword named entities cannot be solved by word-by-word lemmat...
متن کاملThe First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
This paper describes the outcomes of the First Multilingual Named Entity Challenge in Slavic Languages. The Challenge targets recognizing mentions of named entities in web documents, their normalization/lemmatization, and cross-lingual matching. The Challenge was organized in the context of the 6th Balto-Slavic Natural Language Processing Workshop, colocated with the EACL-2017 conference. Eleve...
متن کاملLemmatization of Multi-word Common Noun Phrases and Named Entities in Polish
In the paper we present a tool for lemmatization of multi-word common noun phrases and named entities for Polish called PoLem1. The tool is based on a set of manually crafted rules and heuristics utilizing a set of dictionaries (including morphological, named entities and inflection patterns). The accuracy of lemmatization obtained by the tool reached 97.99% on a dataset with multi-word common ...
متن کاملExtraction and analysis of proper nouns in Slovak texts
Unknown named entity recognition in inflected languages faces several specific problems – the first and foremost is that the entities themselves are inflected1 (Dvonč et al., 1966) leading to a problem of identifying word forms as belonging to the same lexeme, and also the problem of finding correct lemma. In this article we analyse the distribution of word forms for proper nouns in Slovak and ...
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کامل